Assignment 3 (Section 21 & 22)

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  3. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on Tuesday, 20th February 2024 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

    • Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
    • No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
    • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
    • Final answers to each question are written in the Markdown cells. (1 point)
    • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)

Data description

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit.

There is one train data - train.csv, which you will use to develop a model. There are two test datasets - test1.csv and test2.csv, which you will use to test your model. Each observation is a phone call and each column is a variable about the client or the phone call. Each dataset has the following attributes about the clients called in the marketing campaign:

  1. age: Age of the client

  2. education: Education level of the client

  3. day: Day of the month the call is made

  4. month: Month of the call

  5. y: did the client subscribe to a term deposit?

  6. duration: Call duration, in seconds. This attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.

(Source: UCI Data Archive. Please use the given datasets for the assignment, not the raw data from the source. It is just for reference.)

Instructions / suggestions for answering questions

  1. Instruction: Use train.csv for all questions, unless otherwise stated.

  2. Suggestion 1: You may use the functions in the class notes for printing the confusion matrix and the overall classification accuracy based on test / train data.

  3. Suggestion 2:: If you make variable transformations, you will need to do it for all the three datasets. Your code will be a bit concise if you make a function containing all the transformations, and then call it for the training and the two test datasets. You can put this function in the beginning of the code and keep adding transformations to it as you proceed with the assignment. You may need transformations in questions (1) and (13).

1)

Read the datasets. Make an appropriate visualization to visualize how the proportion of clients subscribing to a term deposit change with increasing call duration.

(4 points)

Hints:

  1. Bin duration to create duration_binned. Group the data to find the fraction of clients responding positively to the marketing campaign for each bin in duration_binned. Make a lineplot of percentage of clients subscribing to a term deposit vs duration_binned, where the bins in duration_binned are arranged in increasing order of duration.

  2. You may choose an appropriate number of bins & type of binning that helps you visualize well.

  3. You may also think of other ways of visualization. You don’t need to stick with this one.

2) Predictor duration

Based on the plot in (1), comment whether duration seems to be a useful variable to predict if the client will subscribe to a term deposit.

(1 point)

3) Model based on duration

Develop a logistic regression model to predict if the client subscribed to a term deposit based on call duration. Use the model to make a lineplot showing the probability of the client subscribing to a term deposit based on call duration.

(3 points)

Note

Answer questions 4 to 11 based on the regression model developed in (3).

4) Model significance

Is the regression model in statistically significant? Justify your answer.

(1 point for code, 1 point for answer)

5) Subscription probability in 5 minutes

What is the probability that the client subscribes to a term deposit with a 5-minute marketing call? Note that the call duration in data is given in seconds.

(2 points)

6) Call duration for subscription

What is the minimum call duration (in minutes) for which a client has a 95% or higher chance of subscribing to a term deposit?

(3 points)

7) Maximum call duration

What is the maximum call duration (in minutes) in which a client refused to subscribe to a term deposit? What was the probability of the client subscribing to the term deposit in that call?

(3 points)

8) Percent increase in odds

What is the percentage increase in the odds of a client subscribing to a term deposit when the call duration increases by a minute?

(3 points)

9) Doubling the subscription odds

How much must the call duration increase (in minutes) so that it doubles the odds of the client subscribing to a term deposit.

(3 points)

10) Classification accuracy

What is minimum overall classification accuracy of the model among the classification accuracies on train.csv, test1.csv and test2.csv? Consider a threshold of 30% when classifying observations.

(2 + 1 + 1 points)

11) Recall

What is the minimum Recall of the model among the Recall performance on train.csv, test1.csv and test2.csv? Consider a decision threshold probability of 30% when classifying observations.

Here, Recall is the proportion of clients predicted to subscribe to a term deposit among those who actually subscribed.

(3 points)

12) Subscription probability based on age and education

Develop a logistic regression model to predict the probability of a client subscribing to a term deposit based on age, education and the two-factor interaction between age and education. Based on the model, answer:

  1. People with which type of education (primary / secondary / tertiary / unknown) have the highest percentage increase in odds of subscribing to a term deposit with a unit increase in age? Justify your answer.

  2. What is the percentage increase in odds of a person subscribing to a term deposit for a unit increase in age, if the person has tertiary education.

  3. What is the percentage increase in odds of a person subscribing to a term deposit for a unit increase in age, if the person has primary education.

(1 point for developing the model, 3 points for (a), 3 points for (b), 3 points for (c))

13) Model development

Develop a logistic regression model (using train.csv) to predict the probability of a client subscribing to a term deposit based on age, education, day and month. The model must have:

  1. Minimum overall classification accuracy of 75% among the classification accuracies on train.csv, test1.csv and test2.csv.

  2. Minimum recall of 50% among the recall performance on train.csv, test1.csv and test2.csv.

For all the three datasets - train.csv, test1.csv and test2.csv, print the:

  1. Model summary (only for train.csv),

  2. Confusion matrices,

  3. Overall classification accuracies, and

  4. Recall

Note that:

  1. You cannot use duration as a predictor because its value is determined after the marketing call ends. However, after the call ends, we already know whether the client responded positively or negatively. That is why we have used duration only for inference in the previous questions. It helped us understand the effect of the length of the call on marketing success.

  2. It is possible to develop the model satisfying constrains (a) and (b) with just appropriate transformation(s) of the predictor(s). However, you may consider interactions if you wish. Justify the transformations, if any, with visualizations.

  3. You are free to choose any value of the decision threshold probability for classifying observations. However, you must use the same threshold on all the three datasets.

(10 points)

14) ROC-AUC

Report the probability that the model will predict a higher probability of response for a customer who signs up for the term deposit as compared to the customer who does not sign up, i.e., the ROC-AUC of the developed model in (13).

Hint: Use the functions roc_curve, and auc from the sklearn.metrics module

(3 points)

15) Net-profit

Suppose that the model developed in (13) is used to predict the clients in test1.csv and test2.csv who will respond positively to the campaign. Only those clients who are predicted to respond positively are called during the marketing campaign. Assume that:

  1. A profit of \$100 is associated with a client who responds positively to the campaign,

  2. A loss of \$10 is associated with a client who responds negatively to the campaign

What is the net profit from the campaign? Use the confusion matrices printed in (13).

(4 points)

16) Decision threshold probability

Based on the profit and loss associated with client responses specified in (15), and the model developed in (13), find the decision threshold probability of classification, such that the net profit is maximized. Use train.csv

Proceed as follows:

  1. You would have obtained FPR and TPR for all potential decision threshold probabilities in (14).

  2. Formulate an expression quantifying the net profit per client, in terms of FPR, TPR, and the overall response rate, i.e., proportion of people actually subscribing to the term deposit.

  3. Find the decision threshold probability that maximizes the expression in (2).

(5 points)

17) Net profit based on new decision threshold probability

Using the new decision threshold probability obtained in (16), answer (15), i.e., what is the net-profit associated with the clients in test1.csv and test2.csv if a marketing campaign is performed. Again, only those clients who are predicted to respond positively, based on the new decision threshold probability, are called during the marketing campaign

Also, print the confusion matrices for predictions on test1.csv and test2.csv with the new threshold probability.

(4 points)

18) Model preference

Was the classification accuracy of the model in (13) higher than that of the model in (17)? If yes, then should you prefer the model in (13) for the marketing campaign? Why or why not?

Note: The model in (17) is the same as in (13), except with a different decision threshold probability

(3 points)

19) ROC curve

Plot the ROC curve for the model developed in (13). Mark the point on the curve corresponding to the decision threshold probability identified in (16).

Note that the ROC curve is independent of the decision threshold probability used by the model for prediction

(3 points)

20) Profit with TPR / FPR

Make a scatterplot of TPR vs FPR, and color the points based on net profit per client.

You can use the following code to make the plot if you have the relevant metrics in tpr, fpr, and net_profit

(1 point)

Code
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (9,6)
plt.rcParams["figure.autolayout"] = True
f, ax = plt.subplots()
points = ax.scatter(fpr, tpr, c = net_profit, s=50, cmap="Blues")
f.colorbar(points, label = "Net profit ($) \n(per client)")
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.show()

21) Precision-recall

Compare the precision and recall of the models in (13) and (17) on train.csv.

Note: The model in (17) is the same as in (13), except with a different decision threshold probability

(4 points)

22) Precision-recall: important metric

Based on the above comparison, which metric among precision and recall turns out to be more important for maximizing the net profit in the marketing campaign?

(1 point)

23) Precision-recall curve

Plot the precision-recall curve vs decision threshold probability for the model developed in (13). Mark the points on the curve corresponding to the decision threshold probability identified in (16).

(3 points)

24) Precision-recall vs FPR-TPR

Instead of using the FPR and TPR metrics to find the optimum decision threshold probability in (16), use the precision-recall metrics to find the same.

(5 points)

25) Sklearn

Using train.csv and only sklearn, pandas, and numpy, train a Logistic Regression model. You need the following steps:

  • The response is still y.
  • Predictors are education, month, day and age.
  • Numerical predictors need to be transformed to all their second-order polynomial versions.
  • Categorical predictors need to be one-hot-encoded. They should not interact with the numerical predictors.
  • Afterwards, the all the predictors needs to be standard scaled.

Print the accuracy and recall for both training and test data using a threshold of 0.11. Use test1.csv as the test dataset. Remember that the test dataset needs to go through the exact same transformation pipeline as the training dataset.

(8 points)